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ROBUST AND SPARSE ESTIMATORS FOR LINEAR 
REGRESSION MODELS 


By Ezequiel Smucler* and Victor J. Yohai'I’ 
Universidad de Buenos Aires 

Penalized regression estimators are a popular tool for the anal¬ 
ysis of sparse and high-dimensional data sets. However, penalized 
regression estimators defined using an unbounded loss function can 
be very sensitive to the presence of outlying observations, especially 
high leverage outliers. Moreover, it can be particularly challenging 
to detect outliers in high-dimensional data sets. Thus, robust esti¬ 
mators for sparse and high-dimensional linear regression models are 
in need. In this paper, we study the robust and asymptotic proper¬ 
ties of MM-Bridge and adaptive MM-Bridge estimators: ^q-penalized 
MM-estimators of regression and MM-estimators with an adaptive 
£t penalty. For the case of a fixed number of covariates, we derive 
the asymptotic distribution of MM-Bridge estimators for all q > Q. 
We prove that for q < 1 MM-Bridge estimators can have the oracle 
property defined in Fan and Li (2001). We prove that for all t < 1 
adaptive MM-Bridge estimators can have the oracle property. The 
advantages of our proposed estimators are demonstrated through an 
extensive simulation study and the analysis of a real high-dimensional 
data set. 


1. Introduction. In this paper, we consider the problem of robust and 
sparse estimation for linear regression models. In modern regression analysis, 
sparse and high-dimensional estimation scenarios where ratio of the number 
of predictor variables to the number of observations, say p/n, is high, but 
the number of actually relevant predictor variables to the number of obser¬ 
vations, say s/n, is low, have become increasingly common in areas such as 
bioinformatics and chemometrics. In this type of regression scenarios, due 
to the high-dimensional nature of the data, it is difficult to discover outlying 
observations using simple criteria. Traditional robust regression estimators 
do not produce sparse models and can have a bad behaviour with regards to 
robustness and efficiency when p/re is high, see Maronna and Yohai (2015). 
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Moreover, they cannot be calculated for p > n. Thus, robust regression 
methods for high-dimensional data are in need. 

Modern approaches to estimation in sparse and high-dimensional linear 
regression models include penalized least squares estimators, e.g. the LS- 
Bridge estimator of Frank and Friedman (1993) and the LS-SCAD estima¬ 
tor of Fan and Li (2001). LS-Bridge estimators are penalized least squares 
estimators in which the penalization function is proportional to the iq norm 
with q > 0. They include as special cases the LS-Lasso of Tibshirani (1996) 
{q = 1) and the LS-Ridge of Hoerl and Kennard (1970) {q = 2). The LS- 
SCAD estimator is a penalized least squares estimator in which the penaliza¬ 
tion function, the SCAD, is a non-concave function with several interesting 
theoretical properties. 

The theoretical properties of penalized least squares estimators have been 
extensively studied in the past years. Of special note is the so called oracle 
property dehned in Fan and Li (2001): An estimator is said to have the oracle 
property if the estimated coefficients corresponding to zero coefficients of the 
true regression parameter are set to zero with probability tending to one, 
while at the same time the coefficient corresponding to non-zero coefficients 
of the true regression parameter are estimated with the same asymptotic 
efficiency as if we knew the correct model in advance. 

Knight and Fu (2000) derive the asymptotic distribution of LS-Bridge es¬ 
timators in the classical regression scenario of fixed p and prove that for 
q < 1 these estimators can have the oracle property. They also show that 
for q = 1, the LS-Lasso sets the estimated coefficients corresponding to zero 
coefficients of the true regression to zero with positive probability. The LS- 
Lasso estimator is not variable selection consistent unless rather stringent 
conditions are imposed on the design matrix, and thus in general does not 
posses the oracle property, see Zou (2006) and Buhlmann and van de Geer 
(2011) for details. Moreover, the LS-Lasso estimator has a bias problem: it 
can excessively shrink large coefficients. To remedy this issue, Zou (2006) 
introduced the adaptive LS-Lasso, where adaptive weights are used for pe¬ 
nalizing different coefficients of the norm of the coefficients, and showed 
that the adaptive Lasso can have the oracle property. 

We note that whereas there exist extremely efficient algorithms to cal¬ 
culate the LS-Lasso, see Buhlmann and van de Geer (2011), LS-Bridge esti¬ 
mators with q <1 seem to be somewhat difficult to calculate. An algorithm 
to calculate LS-Bridge estimators with g < 1 is described in Huang et al. 
(2008). As Zou (2006) points out, adaptive LS-Lasso estimators can be calcu¬ 
lated using any of the algorithms available to calculate LS-Lasso estimators. 

Penalized least squares estimators are not robust and may be highly ineffi- 
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dent under heavy tailed errors. In an attempt to remedy this issue, penalized 
M-estimators defined using a convex loss function have been proposed; see 
for example Wang et. al. (2007) and Li et al. (2011). Unfortunately, these 
estimators are not robust with respect to contaminations in the predictor 
variables. 

Alfons et al. (2013) proposed the Sparse-LTS estimator, a least trimmed 
squares estimator with a penalization. In a simulation study, Alfons et al. 
(2013) show that the Sparse-LTS can be robust with respect to contamina¬ 
tion in both the response and predictor variables. The Sparse-LTS estimator 
can be calculated for p > n. However, Alfons et al. (2013) do not provide any 
asymptotic theory for their estimator. Khan et al. (2007) propose a robust 
version of the LARS procedure, see Efron et al. (2004), and in extensive sim¬ 
ulations show that the RLARS procedure produces well behaved estimators 
under diverse contamination models. However, since the RLARS procedure 
is not based on the minimization of a clearly defined objective function, a 
theoretical analysis of its properties is difficult. Wang et. al. (2013) proposed 
a penalized regression estimator based on an exponential squared loss func¬ 
tion. They prove that a local minimum of the objective function used to 
define their estimator can have the oracle property. On the other hand, their 
proposed estimators cannot be calculated in regression scenarios with p > n. 
Maronna (2011) introduced S-Ridge and MM-Ridge estimators: ^ 2 -penalized 
S- and MM-estimators of regression. In extensive simulation studies he shows 
that these estimators can be robust in a variety of contamination scenarios. 
However, ^ 2 -penalized regression estimators do not produce sparse models. 
Maronna (2011) does not provide any asymptotic theory for these estima¬ 
tors. 

In this paper, we study the robust and asymptotic properties of MM- 
Bridge and adaptive MM-Bridge estimators: £g-penalized MM-estimators of 
regression and MM-estimators with an adaptive it penalty. We obtain lower 
bounds on the breakdown points of MM-Bridge and adaptive MM-Bridge 
estimators. For the case of a fixed number of covariates, we prove the strong 
consistency of MM-Bridge and adaptive MM-Bridge estimators under gen¬ 
eral conditions. We derive the asymptotic distribution of MM-Bridge esti¬ 
mators for all q and prove that for q < 1 they can have the oracle property. 
For the special case of 5 = 1 we show that the coordinates of the MM-Bridge 
estimator corresponding to null coefficients of the true regression parameter 
will be set to zero with positive probability. See the comments following The¬ 
orem 7. We show that adaptive MM-Bridge estimators can have the oracle 
property for all f < 1. We propose an algorithm to calculate both MM-Bridge 
estimators with q = 1, which we call MM-Lasso estimators, and adaptive 
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MM-Bridge estimators with t = 1, which we call adaptive MM-Lasso esti¬ 
mators. Our algorithm uses the S-Ridge estimator of Maronna (2011) as an 
initial estimator and iteratively solves a weighted-Lasso type problem. Even 
though we derive our asymptotic results for fixed p, MM-Lasso and adaptive 
MM-Lasso estimators can be calculated for p > n. In extensive simulations, 
we study the performance with regards to stability in the presence of high- 
leverage outliers, and prediction accuracy and variable selection properties 
for uncontaminated samples of the MM-Lasso and adaptive MM-Lasso esti¬ 
mators. Finally, we apply our proposed estimators to a real high-dimensional 
data set. 

The rest of this paper is organized as follows. In Section 2 we review 
the definition and some of the most important properties of MM and S- 
estimators of regression. In Section 3 we define S-Bridge, MM-Bridge and 
adaptive MM-Bridge estimators, we study their robust and asymptotic the¬ 
oretical properties and we describe an algorithm to compute MM-Lasso and 
adaptive MM-Lasso estimators. In Section 4 we conduct an extensive simu¬ 
lation. In Section 5 we apply the aforementioned estimators to a real high¬ 
dimensional data set. Conclusions are provided in Section 6 . Finally, the 
proof of all our results are given in the Appendix. 

2. MM and S-estimators of regression. We consider a linear re¬ 
gression model with random carriers: we observe (x)*",?/*) i = l,...,n, i.i.d. 
{p+ l)-dimensional vectors, where yi is the response variable and Xj G RP is 
a vector of random carriers, satisfying 

(1) Vi = xJPo+'^i for * = •••) ri, 

where oq and /3 q G RP are to be estimated and Ui is independent of Xj. 
For /3 G RP let r(/3) = (ri(/3),..., rn(/3)), where ri{(3) = Pi — xf/3. Some of 
the coefficients of (3q may be zero, and thus the corresponding carriers do 
not provide relevant information to predict y. We do not know in advance 
the set of indices corresponding to coefficients that are zero, and it may be 
of interest to estimate it. For simplicity, we will assume /3o = (/^o/®o//)> 
where j G R*, Poyi G RP“*, all the coordinates of Pq j G R^ are non-zero 
and all the coordinates of /3 q jj G RP“® are zero. 

Let Fq be the distribution of the errors Ui, Gq the distribution of the 
carriers Xj and Hq the distribution of pij ,yi). Then Hq satisfies 

(2) Ho{x,y) = Go(x)Fb(?/ - x'^/3o). 

Let X/ stand for the first s coordinates of x and let Gq,/ be its distribution. 
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For b G and g > 0 we note 


b|| 



1/9 


and ||b|| = ||b|| 2 . Throughout this paper, /?-function will refer to a bounded 
p-function, in the sense of Maronna et al. (2006). A popular choice of p- 
functions is Tukey’s Bisquare family of functions given by 



where c > 0 is a tuning constant. 

Given a sample u = (ui, ...,Un) from some distribution F and 0 < 6 < 1 
the corresponding M-estimate of scale s,i(u) is defined by 

Sn{n) = inf |s > 0 : i (^) < fej . 

It is easy to prove that Sn(u) > 0 if and only if ^{i : ttj = 0} < (1 — b)n, 
and in this case 


( 4 ) 


iv 

\Sn{u) ) 


b. 


The robustness of an estimator is measured by its stability when a small 
fraction of the observations are arbitrarily replaced by outliers that may not 
follow the assumed model. A robust estimator should not be much affected 
by a small fraction of outliers. A quantitative measure of an estimator’s 
robustness, introduced by Donoho and Huber (1983), is the hnite-sample 
replacement breakdown point. Loosely speaking, the finite-sample replace¬ 
ment breakdown point of an estimator is the minimum fraction of outliers 
that may take the estimator beyond any bound. For a regression estimator, 
this measure is defined as follows. Given a sample Zj = (x)'",pj), i = 1, ...,n, 
let Z = {zi,..., Zn} and let /3(Z) be a regression estimator. The hnite-sample 
replacement breakdown point of /3 is then dehned as 

FBP0) = —, 
n 

where 

m* = max |m > 0 : ^(Z^) is bounded for all Z^ G j 
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and Zm is the set of all datasets with at least n — m elements in common 
with Z. 

A breakdown point equal to e* only guarantees that for any given contam¬ 
ination fraction e < £*, there exists a compact set such that the estimator 
in question remains in that compact set whenever a fraction of £ observa¬ 
tions is arbitrarily modified. However, this compact set may be very large. 

Thus, although a high breakdown point is always a desirable property, an 
estimator that has a high breakdown point can still be largely affected by a 
small fraction of contaminated observations. 

Given a sample (x^, yj), i = 1,..., n from the model given in (2), Rousseeuw and Yohai 
(1984) define the S-estimator of regression as 

(5) Ps = arg min s„(r(/3)). 


where is a M-estimate of scale. Fasano et al. (2012) derive the asymptotic 
distribution of S-estimators of regression under very general conditions. S- 
estimators can always be tuned so as to attain the maximum possible finite- 
sample replacement breakdown point for regression equivariant estimators. 
However, S-estimators cannot combine high breakdown point with high ef¬ 
ficiency at the normal distribution, see Hossjer (1992). 

Let ,yi), f = 1, ...,n, be a sample satisfying (2), and po and pi be two 
p-functions satisfying pi < pq. Then Yohai (1987) defines the MM-estimator 
of regression as 


( 6 ) 


n 

^MM = argmm 

^ i=l 


( niP) \ 

\Snir0i))J ’ 


where 0i is a consistent and high breakdown point estimate of Pq and 
Sn(r(/3i)) is the M-estimate of scale of the residuals of Pi, calculated using 
po and b. 

Yohai (1987), proves that under general conditions, MM-estimators are 
strongly consistent for Pq, and furthermore 

(’^) V^0MM ~ Po) ) 


where Vx = Egq{xx^), s{Pq) is defined by 


EhoPo 


( y-^PQ \ 

V s{Pq) ) 


= h. 


a{p,FQ) = 
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and 

Besides, he shows that pi can be chosen so that the resnlting MM- 
estimator has simultaneonsly the two following properties: 

• Normal asymptotic efficiency as close to one as desired. 

• Breakdown point greater than or equal to that of the initial estimator. 

Maronna et al. (2006) recommend to take po = P^{u/co) and pi = p^{u/ci). 
The tuning constant for po, cq, should be chosen so that the resulting M- 
estimate of scale be consistent for the error standard deviation in the case 
of normal errors. The choice of ci should aim at striking a balance between 
robustness and efficiency. Maronna et al. (2006) recommend to choose ci so 
that the MM-estimator has an asymptotic efficiency of 85% at the normal 
distribution. The reason for choosing an 85% asymptotic efficiency at the 
normal distribution, is that at this level of the efficiency the MM-estimator 
has the same maximum asymptotic bias as the initial S-estimator of regres¬ 
sion for the case of normal errors and normal carriers. 


3. S-Bridge, MM-Bridge and adaptive MM-Bridge estimators 
of regression. Given a sample (x)'",?/j), i = l,...,n, 7 ^ > 0, r > 0, a p- 
function po and 0 < 6 < 1, we dehne the £r-penalized S-Bridge estimator of 
regression following Maronna (2011) as 

( 8 ) ^ps = arg min n 4(r(/3)) + InWPWl, 


where s„(r(/3)) is the residual scale estimate defined using po and b. If the 
model contains an intercept, then it is not penalized. 

It is easy to see that 

( 9 ) 0ps\\: < wPst, 

where Pg is the S-estimator calculated using po and b. 

Given another p-function pi which satisfies pi < po and > 0 we define 
the .^q-penalized MM-Bridge estimator of regression as 


( 10 ) 


n 


Pb 


= arg min 
PgRp 




f n(/3) \ 
[sn(r(Pi))J 




where P^ is a consistent initial estimate of Pq. Clearly, the robustness of the 
MM-Bridge estimator will depend heavily on the robustness of the initial 




estimate. If the model contains an intercept, then it is not penalized. For 
the case q = 1 we will call the resulting estimator MM-Lasso. 

Note that our dehnition of a MM-Bridge estimator with r = 2 and 
q = 2, is not exactly the same as the definition of MM-Ridge estimators of 
Maronna (2011). For a given A^, the MM-Ridge of Maronna (2011) is equal 
to our MM-Ridge estimator calculated with Xn/Sn{r0i))‘^■ Nonetheless, our 
asymptotic results can be very easily adapted to cover the MM-Ridge esti¬ 
mators as defined by Maronna (2011). However, this is not the case for our 
results concerning the hnite-sample breakdown point of MM-Bridge estima¬ 
tors. Maronna (2011) points out that the finite-sample breakdown point of 
MM-Ridge estimators is, for a fixed penalization parameter and according 
to his definition of MM-Ridge estimators, greater than or equal to the break¬ 
down point of the residual scale Sn- In Theorem 1, we show that for a fixed 
penalization parameter and according to our definition of MM-Bridge esti¬ 
mators, the breakdown point of any MM-Bridge estimator is greater than 
1 — 1/n. 

Given (^ > 0, t > 0 and in we define the adaptive MM-Bridge estimator 
of regression as 


( 11 ) 


n 


Pa 


= arg min 




( rm \ 

\sn{r0i)) J 


+ tn ^ 


1 / 3 . 


^ I/32jI 


where f32 is a consistent initial estimate of /3g. Clearly if /32j = 0 for some j, 
then 0A,j — 0- H model contains an intercept, then it is not penalized. 
For the case t = 1 we will call the resulting estimator adaptive MM-Lasso. 
Note that for coefficients corresponding to large coefficients of /32, the adap¬ 
tive MM-Lasso employs a small penalty; this ameliorates the bias issues 
associated with the ii penalty. 

Wang et. al. (2013) prove that their estimator can have the highest possi¬ 
ble breakdown point among regression equivariant estimators, but it must be 
noted that their estimator is not regression equivariant. Alfons et al. (2013) 
show that the breakdown point of the Sparse-LTS estimator is (n —/i-|-l)/n, 
where n — h is the number of trimmed observations, and prove that the 
breakdown point of the LS-Lasso estimator is 1/n. Note that it follows im¬ 
mediately from (9) that for any 7 „, the finite-sample breakdown point of 
is at least as high as that of Pg. In Theorem 1, we prove that for any 
fixed \n > 0, the breakdown point of Pp is equal to 1 — 1/n. In Theorem 
2 , we prove that for any fixed in > 0, the breakdown point of Pa is greater 
than or equal to the breakdown point of P 2 - However, one could argue that 
since Pps-, Pb Pa are not regression equivariant, these results are rather 
vacuous. See Davies and Gather (2006). 
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Theorem 1. //A„ > 0 is fixed, then FBP{^^) > 1 — 1/n 

Theorem 2. Iftn>0 is fixed, then FBP{Pjfi) > FBP02)- 

Note that if 02 = then FBP{0jfi > 1 — 1/n whenever A^, in > 0. In 
practice, 7 „, A„ and in may be chosen via some data-driven procedure such 
as cross-validation. In this case, the breakdown point of the resulting MM- 
Bridge and adaptive MM-Bridge estimators may be lower than 1 — 1/n. The 
robustness of the resulting estimators will depend sorely on the robustness 
of the cross-validation scheme, and hence the use of robust residual scales 
as objective functions, instead of the classical root mean squared error, is 
crucial. 

3.1. Asymptotics. We now describe the set-up to study the asymptotic 
properties of S-Bridge, MM-Bridge and adaptive MM-Bridge estimators of 
regression. We will assume that 

Bl. pq and pi are twice continuously differentiable and eventually constant. 

B2. Pgq (x^/3 = O) < 1 — 6 for all non-zero /3 G M^. 

B3. To has an even continuous density, /o, that is a monotone decreasing 
function of |u| and a strictly decreasing function of |u| in a neighbor¬ 
hood of 0 . 

A family of p-functions that satisfies [Bl] is Tukey’s Bisquare family of func¬ 
tions, given in (3). Condition [B2] is needed in the proof of the consistency 
of S-Bridge estimators. Note that condition [B3] does not require finite mo¬ 
ments from Tq. Thus, extremely heavy tailed error distributions, such as 
Cauchy’s distribution, can be easily seen to satisfy [B3]. However, [B3] does 
impose a rather stringent symmetry assumption on the error distribution. 
This requirement greatly simplifies the asymptotic treatment of the estima¬ 
tors and is usual in robust statistics. 

The following theorem proves the strong consistency of S-Bridge, MM- 
Bridge and adaptive MM-Bridge estimators of regression whenever 7 ^ = 
o(n), Xn = o{n) and in = o(n) respectively. 

Theorem 3. Let 0i[,yi), i = l,...,n, be i.i.d observations with distri¬ 
bution Hq, which satisfies (2). Assume [B1]-[B3] hold. Then 

(i) If In = o{n), 0PS Pq. 

(ii) If Xn = o{n), 0p /3o. 

fiiij If in = o{n), 0A Po- 
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In practice, we will use the S-Ridge estimator of Maronna (2011) as the 
initial estimate f3i in (10) and (11). Note that according to Theorem 3 and 
the remarks above Theorem 1, the S-Ridge is a high breakdown point and 
consistent estimate of /3 q, as long as the penalization parameter satishes 

In = 0(v^)- 

In order to obtain the rate of convergence of MM-Bridge and adaptive 
MM-Bridge estimators we will have to make the following additional as¬ 
sumption: 

B4. Go has hnite second moments and Vx = is non-singular. 

In the next theorem, we prove the -yn-consistency of MM-Bridge and 
adaptive MM-Bridge estimators. 

Theorem 4. Let i = l,...,n, he i.i.d observations with distri¬ 

bution Hq, which satisfies (2). Assume [B1]-[B4] hold. Then 

(i) If Xn = 0{^/n), then \\pB “ M = Op{l/^/n). 

(a) If Ln = 0{y/n), then \\Pa - ^o\\ = Op{l/^/n). 

Remark 1. From now on, we will assume that the initial estimator used 
to define the penalty weights for the adaptive MM-Bridge estimator, $ 2 : 
y/n-consistent. For example, according to Theorem f, we could take (32 
be some MM-Bridge estimator calculated with Xn = 0{^/n). 

Let /3^ j stand for the first s coordinates of and jj for the remaining 
p — s. Let j stand for the hrst s coordinates of and (Bp jj for the 
remaining p — s. The following theorem shows that, as long as g > t — 1 and 
t < 1, adaptive MM-Bridge estimators can be variable selection consistent, 
and that if g < 1, then MM-Bridge estimators can be variable selection 
consistent as well. In particular, taking t = 1, we prove the variable selection 
consistency of adaptive MM-Lasso estimators. 

Theorem 5. Let (x?’,?/*), i = l,...,n, be i.i.d observations with distri¬ 
bution Hq, which satisfies (2). Assume [B1]-[B4] hold. 

(i) Suppose q < 1, Xn = 0{^/n) and Xnjn^^’^ —)• oo. Then 

P {^ 3,11 = > 1. 

(a) Suppose t < 1, Ln = 0{y/n) and Lnn^‘'~^'^^‘^ oo. Then 

I® {Pa,!! = Op-s) —^ 1. 
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Next we derive the asymptotic distribution of 0^ j and 

Theorem 6. Let i = be i.i.d observations with distri¬ 

bution Hq, which satisfies (2). Assume [B1]-[B4] hold. 

(i) Suppose q < I, \nl^fn —^ 0 and ^ oo. Then 

Vn0B,i - Po,i) ^0, • 

(a) Suppose t < 1, Lnj \pn — ^ 0 and — >• oo. Then 

MiiA.,-»«.,) ^ N. . 

Here a{'ip,F()) and b{'ijj,FQ) are as in (7) and = Eco^iy^■ 


Theorem 5 together with Theorem 6 prove that /3^ and /3g can have the 
oracle property as long as o > t — 1 and t < 1, and q < 1 respectively. That 
is: the estimated coefficients corresponding to null coordinates of the true 
regression parameter are set to zero with probability tending to 1, while at 
the same time the coefficients corresponding to non-null coordinates of the 
true regression parameter are estimated with the same asymptotic efficiency 
as if we had applied a non penalized MM-estimators to the relevant carriers 
only. 

In Theorem 7 we derive the asymptotic distribution of for g > 1. Our 
theorem is analogous to Theorem 2 of Knight and Fu (2000). 


Theorem 7. Let {'x.f^yi), i = l,...,n, be i.i.d observations with dis¬ 
tribution Hq, which satisfies (2). Let q > 1. Assume [B1]-[B4-] hold and 
Xn/Vn^Xo. Then 

Vn0B - Po) argmm(H), 


where 

R{z) 


-z'^W-L 


1 ^ 

2s[j3 Fo)z^V^z -L Xoq ^ Zjsgn{(3oj)\/3oj 


g-l 


for q> 1, 


R{z) = -z'^W -L ^ b(-0o,Fo)z'^I4z -L Xo^{zjsgn{l3oj)I{l3oj / 0) 

+\zj\I{/3Qj = 0 )), 


for q = l and W ~ IVp (O, a(?/:o, Tb)/s(/3o)^I4). 
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Note that if Xq = 0, has the same asymptotic distribution as the corre¬ 
sponding non-penalized MM-estimator. If Aq > 0 and q = 1, the coordinates 
of /3g corresponding to null coefficients of /3 q will be set to zero with posi¬ 
tive probability, the proof is essentially the same as the one that appears in 
pages 1361-1362 of Knight and Fu (2000). However, one can show that 

limsupP (^Pb jj = < c < 1 , 

where c depends on Xq and /3 q. The proof is essentially the same as the 
proof of Proposition 1 of Zou (2006). 

If q > I the amount of shrinkage of the estimated regression coefficients 
increases with the magnitude of the true regression coefficients. Hence, for 
’’large” parameters, the bias introduced by MM-Bridge estimators with q > 
1 may be unacceptably large, at least for the fixed p scenario. For the case 
q = 2 we can calculate the asymptotic distribution of the estimator explicitly. 
It follows easily from Theorem 7 that the asymptotic distribution of the 
MM-Ridge estimator is 


Np ( —2Ao 


sjPof 


Po,s{Po) 


2 «(V’i 


Fo, 


-1 


In the next theorem we derive the asymptotic distribution of /3^ for q < 1 
when Xnjrfi^'^ —)• Aq. 


Theorem 8. Let (x)^,y*), i = l,...,n, he i.i.d observations with dis¬ 
tribution Hq, which satisfies (2). Let q < 1. Assume [B1]-[B4] hold and 
—)• Aq. Then 


y/n{PB - Po) argmin(i?), 

where 

1 ^ 

R{z) = -z'^W -h . b{po,Fo)z^V^z + Aq V = 0), 

and Wr^Np (0, a{Po, Fq)/ s{Po)^V ^). 

It follows from Theorem 8 that for g < 1, if Aq > 0, the coordinates of 
Pb corresponding to null coefficients of Pq will be set to zero with positive 
probability. Moreover, in this case the shrinkage only affects the coordinates 
of the estimators corresponding to null coefficients of Pq, and hence no 
asymptotic bias is introduced. 
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3.2. Computation. In this section, we describe an algorithm to obtain 
approximate solutions of (10) for q = 1, i.e. MM-Lasso estimators. Through 
out this section we will assume that our model, (1), contains an intercept, 
and that the hrst coordinate of each x.j equals 1. Let X be the matrix with 
Xj as rows. 

Prior to any calculations, all the columns of X, except the first one, are 
centered and scaled using the median and the normalized median absolute 
deviation respectively. The response vector y is centered using the median. 
At the end, the final estimates are expressed in the original coordinates. 

We take the S-Ridge estimator of Maronna (2011), which we note (3ps, as 
the initial estimate in (10). The penalization parameter for S-Ridge estima¬ 
tor, 7 „, is chosen via robust 5-fold cross-validation, as described in Maronna 
(2011). Let Sn = Sn{r0ps)). 

Let 'w{u) = 'ijji{u)/u, where -01 is the derivative of pi. For a given f3, let 
LOi = w{ri{(3)/sn). Suppose A„ is given, and let W be the diagonal matrix 
formed by ^/uJi ,Let y* = Wy and X* = WX. Let (3p be the 
MM-Lasso estimator. It is easy to show that 0p satisfies 


X*^(y* - X*/3) + A, 


0 

signih) 


sign{Pp+i) 


) 


= 0 


'p+D 


where = Op+i stands for a change of sign. Note that 
equals k* = (yCJi ,For each j = 2, ...,p-|- 
column of X* and let 


k*rx*(i) 


= 


the first column of X* 
1 let be the j-th 


Then can be decomposed as the sum of two vectors: r/jk*, in the di¬ 
rection of k*, and — r/jk*, orthogonal to k*. Let X*-*- be the 

matrix with columns x*-’-^^), ...,x*-’-(^’'’'^). It is easy to show that satisfies 


( 12 ) 

(13) 


k*y* - ||k*|| (/?! + V2l32 + ••• + Pp+il3p+i) = 0 


-X*^^ f3) + Xns. 


*±Tf 


0 

sign{l32) 


sign{0p+i) 


) 


= 0 


P+1- 


We note that if k*,y* and X*-*-^ where known, Pp 2 ^---i ^B,p+i could be 
estimated by solving equation (13) using some algorithm to solve Lasso-type 
problems, e.g. the LARS procedure or Coordinate Descent Optimization, 
without including an intercept. Then fip i could be solved easily from (12). 
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The fact that k*,y* and depend on (3^ suggests an iterative pro¬ 

cedure, as is usual in robust statistics. Starting from j3pg we iteratively 
solve equation (13) using the LARS algorithm without including an inter¬ 
cept and then solve for the intercept in ( 12 ). Call the estimate at the 
z-th iteration. Convergence is declared when 


where 5 is some fixed tolerance parameter. In our simulations we took 5 = 
10 “^. 

Regarding the calculation of adaptive MM-Lasso estimators, we note that 
solving ( 11 ) is equivalent to solving 

i=l ' ^ 

where Xjj = Xjj|/32j|‘' for j = l,...,p and taking 13a, j = /3j|/32,j|‘'. Hence, 
our procedure to calculate MM-Lasso estimators can be used to calculate 
adaptive MM-Lasso estimators, simply applying the routine to the data with 
weighed carriers. To calculate our proposed adaptive MM-Lasso estimator, 
we take = /3p5, 02 — 0B 9’^'^ ? = 1- 

In practice, we chose the p-functions used to calculate the initial S-Ridge 
estimator, the MM-Lasso estimator and the adaptive MM-Lasso estimator 
of the form po = Pco Pi — Pci where ci > cq and is as in (3). The 
tuning constants cq and ci are chosen as in Maronna (2011). 

The penalization parameter for 0p, A„, is chosen over a set of candidates 
via robust 5-fold cross validation, using a r-scale of the residuals as the 
objective function. The r-scale was introduced by Yohai and Zamar (1988) 
to measure in a robust and efficient way the largeness of the residuals in a 
regression model. The set of candidate lambdas is taken as 30 equally spaced 
points between 0 and Xmax, where Xmax is approximately the minimum A 
such that all the coefficients of 0p except the intercept are zero. To estimate 
Xmax we first robustly estimate the maximal correlation between y and the 
columns of X using bivariate winsorization as advocated by Khan et al. 
(2007). We use this estimate as an initial guess for Xmax and then improve 
it using a binary search. If p > n, then 0 is excluded from the candidate 
set. The penalization parameter for /3^, in, is chosen using the same scheme 
used to choose A„. 

The initial S-Ridge estimate is calculated using our own adaption of 
Maronna’s MATLAB code to C-|— 1 -. To solve equation (13) we use the Fast- 
Lasso() function from the robustHD R package (Alfons (2014)). We use the 
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foreach R (Revolution Analytics and Weston (2013)) package for parallel 
computations when it comes to finding optimal penalization parameters via 
cross-validation. This provided a signihcant reduction in computing times 
in computers with several cores. Extensive parts of our computer code are 
written in C++ and interfaced with R using the RcppArmadillo package 
(Eddelbuettel and Sanderson (2014)). An R package that includes the func¬ 
tions to calculate the estimators we propose is available at http: //esmucler. github. io/mmlasso/. 

4. Simulations. In this section, we compare the performance with re¬ 
gards to prediction accuracy and variable selection properties of 

• The MM-Lasso estimator described in the previous section. 

• The adaptive MM-Lasso estimator described in the previous section. 

• The Sparse-LTS of Alfons et al. (2013). The penalization parameter 
for this estimator is chosen using a BIC-type criterion as advocated 
by the authors. The estimator was calculated using the sparseLTS() 
function from the robustHD R package. 

• The LS-Lasso estimator. The penalization parameter for this estimator 
was chosen using 5-fold cross validation using the sum of the squared 
residuals as the objective function. The estimator was calculated using 
the lars() function from the lars R package (Hastie and Efron (2013)). 

• The adaptive LS-Lasso estimator. The weights used were the recip¬ 
rocal of an initial LS-Lasso estimator, calculated as above. Both the 
initial and the final penalization parameters were chosen using 5-fold 
cross validation using the sum of the squared residuals as the objective 
function. The estimator was calculated using the lars() function from 
the lars R package. 

• The Maximum Likelihood Oracle estimator, that is, the Maximum 
Likelihood estimator applied to the relevant carriers only. When the 
errors follow a normal distribution, this is the Least Squares estimators 
applied to the relevant carriers only. Note that in any case, this is not 
a feasible estimator, and is included for benchmarking purposes only. 

• For the contaminated scenarios, we will also include the Oracle MM es¬ 
timator: an MM-estimator, calculated with Tukey’s bisquare function 
and tuned to have 85% normal efficiency, applied to the relevant car¬ 
riers only. The estimator was calculated using the lmRob() function 
from the robust R package (Konis et al. (2014)). Once again, note 
that this is not a feasible estimator, and is included for benchmarking 
purposes only. 
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4.1. Scenarios. To evaluate the estimators we generate two independent 
samples of size n of the model y = x^/3q + u. The first sample, called 
the training sample, is used to £t the estimates and the second sample, 
called the testing sample, is used to evaluate the prediction accuracy of 
the estimates. We considered three possible distributions for the errors: a 
zero mean normal distribution. Student’s t-distribution with three degrees 
of freedom (t(3)) and Student’s t-distribution with one degree of freedom 
(t(l)). The hrst case corresponds to the classical scenario of normal errors, 
the second case has heavy-tailed errors and the third case has extremely 
heavy-tailed errors. For the first two cases we use the prediction root mean 
squared error (RMSE) to evaluate the prediction accuracy of the estimates. 
For the third case, since Student’s t-distribution with one degree of freedom 
does not have a finite first moment, we use the median of the absolute 
value (MAD) of the prediction residuals as a measure of the the estimators 
prediction accuracy. We also evaluate the variable selection performance of 
the estimators by calculating the false negative ratio (FNR), that is, the 
fraction of coefficients erroneously set to zero, and the false positive ratio 
(FPR), the fraction of coefficient erroneously not set to zero. 

We consider the following five scenarios for the sample size, the number 
of covariates, (3q and the distribution of the carriers. 

1. We take p = 8, n = AO and (3q given by: component 1 is 3, component 

2 is 1.5, component 6 is 2 and the rest of the coordinates are set to 
zero. We take x ^ iVp(0,5]) with Sjj = with p = 0.5. For the 

case of normally distributed errors, we take the standard deviation of 
the errors to be u = 3. 

2. The same as the last one, but with n = 60 and a = 1. 

3. We take p = 30, n = 100 and Pq given by: components 1-5 are 2.5, 
components 6-10 are 1.5, components 11-15 are 0.5 and the rest are 
zero. We take x ^ Np{0, S) with Sjj = with p = 0.95. For the 
case of normally distributed errors, we take the standard deviation of 
the errors to be cr = 1.5. 

4. We take p = 200, n = 100 and Pq given by: components 1-5 are 2.5, 

components 6-10 are 1.5, components 11-15 are 0.5 and the rest are 
zero. The first 15 covariates (xi, ...,xi 5 ) and the remaining 185 covari¬ 
ates (xi6, ...,X2oo) are independent. The first 15 covariates have a zero 
mean multivariate normal distribution. The pairwise correlation be¬ 
tween the ith and jth components of (xi, ...jXis) is with p = 0.5 
for i,j = 1,...,15. The final 185 covariates have a zero mean mul¬ 
tivariate normal distribution. The pairwise correlation between the 
ith and jth components of (xie, X200) is with p = 0.5 for 
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i,j = 16, ...,200. For the case of normally distributed errors, we take 
the standard deviation of the errors to be a = 1.5. 

5. The same as the last one, but with p = 0.95. 

6. The same as Scenario 1, but with p = 250 and n = 50. 

In Scenario 1 we have a moderately high p/n ratio. In Scenario 2 we have 
a relatively low p/n ratio. In Scenario 3 we have p < n and high p/n ratio 
and in Scenarios 4, 5 and 6 we have p > n. Scenarios 1 and 2 were analysed in 
Tibshirani (1996) and Fan and Li (2001). Scenarios 3, 4 and 5 were analysed 
in Huang et al. (2008). 

To evaluate the robustness of the estimators for the case of high-leverage 
outliers, we introduce contaminations in all six scenarios, for the case of 
normal errors. Note that we only contaminate the training sample and not 
the testing sample. We take m = [O.ln] and for i = 1, ..,m we set y* = 5yo 
and Xj = (5, ...,0). We moved yo in an uniformly spaced grid between 0 
and 3 with step 0.1 and then between 3 and 10 with step 1. To summarize 
the results for the contaminated scenarios we report for each estimator the 
maximum RMSE, FNR and FPR over all outlier sizes yo- We note that 
the RMSEs of the LS-Lasso and the adaptive LS-Lasso are unbounded as a 
function of the outlier size and thus the range of outlier sizes considered aims 
at finding the maximum RMSE of the MM-Lasso, the adaptive MM-Lasso 
and the Sparse-LTS. 

The number of Montecarlo replications for the uncontaminated scenarios 
was M = 500. The number of Montecarlo replications for contaminated 
scenarios was reduced to M = 100, to keep computation times reasonably 
low. 
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4.2. Results. We now present the results of our simulation study. All 
results are rounded to two decimal places. Table 1 shows the results for 
Scenarios 1 through 6 without contamination. 

Regarding the prediction accuracy of the estimators, for the case of normal 
errors, the MM-Lasso and the adaptive MM-Lasso have a RMSE of the same 
order, and at times even lower than that of Lasso and the adaptive Lasso. 
The Sparse-LTS shows a good behaviour in Scenarios 1, 2 and 5, but its 
RMSE is much larger than that of the other estimators for the remaining 
scenarios. For the case of errors with t(3) or t(l) distribution, the MM-Lasso 
and the adaptive MM-Lasso show the best overall performance. We were 
surprised by the fact that for t(3) errors, the Lasso and the adaptive Lasso 
have a reasonably low RMSE when compared with the maximum likelihood 
oracle. As expected, the Lasso and the adaptive Lasso lose all predictive 
power when the errors have a t(l) distribution. Except for Scenarios 3, 4 
and 6, the Sparse-LTS shows a reasonably good performance. Regarding the 
variable selection properties of the estimators, we note that the FPR and 
the FNR of the MM-Lasso are comparable to that of the Lasso, and the 
FPR and FNR of the adaptive MM-Lasso are comparable to that of the 
adaptive Lasso for the case of normal errors. For errors with t(3) or t(l) 
distribution, the MM-Lasso and the adaptive MM-Lasso generally show the 
best behaviour. The FPR of the adaptive MM-Lasso is lower than that of 
the MM-Lasso, but the price to pay for this improvement is an increase in 
the FNR. Note that for Scenarios 1, 2 and 3 the Sparse-LTS has a rather 
high FPR, always greater than 0.5. 

In Table 2 we show the results for Scenarios 1 through 6 under high- 
leverage contamination. The MM-Lasso and the adaptive MM-Lasso show 
the best overall behaviour. The Sparse-LTS shows a good behaviour for 
Scenarios 1, 2, and the best behaviour for Scenario 5, but its maximum 
RMSE is much larger than that of the MM-Lasso and the adaptive MM- 
Lasso for the rest of the scenarios. As expected, the maximum RMSE of the 
Lasso and the adaptive Lasso is very large in all cases. In Figure 1 we show 
the RMSEs of the estimators as a function of the outlier size for Scenario 
3. The MM-Lasso has the overall best behaviour, followed closely by the 
adaptive MM-Lasso. Note that the RMSE curves of the Lasso and of the 
adaptive Lasso are unbounded as a function of the outlier size: by taking 
larger outlier sizes the maximum RMSEs of the MM-Lasso, the adaptive 
MM-Lasso and the Sparse-LTS would not change, but those of the Lasso and 
the adaptive Lasso would increase without bound. Regarding the variable 
selection properties of the estimators, the MM-Lasso and the adaptive MM- 
Lasso show the best overall balance between a low FNR and a low FPR. 


RMSE 
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Note that for Scenarios 1, 2, 3 the maximum FPR of the Sparse-LTS is very 
high. 



yo 


Fig 1 . RMSEs as a function of outlier sizes for each of the estimators for the third scenario, 
with p — 30, n = 100, normal errors and 10% contamination. RMSEs are averaged over 
100 replications. 
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Scenario 

Normal 



tiff) 



til) 




RMSE 

FNR 

FPR 

RMSE 

FNR 

FPR 

MAD 

FNR 

FPR 

1 (n,p) = (40,8) 

MM-Lasso 

3.42 

0.04 

0.52 

1.77 

0 

0.52 

1.36 

0.01 

0.50 

adaptive MM-Lasso 

3.43 

0.09 

0.27 

1.75 

0 

0.20 

1.32 

0.02 

0.21 

Sparse-LTS 

3.92 

0.03 

0.82 

1.91 

0 

0.85 

1.44 

0 

0.69 

Lasso 

3.33 

0.02 

0.43 

1.84 

0 

0.46 

9.9 

0.38 

0.28 

adaptive Lasso 

3.28 

0.06 

0.26 

1.82 

0.01 

0.29 

10 

0.46 

0.19 

Oracle 

3.15 

0 

0 

1.69 

0 

0 

1.16 

0 

0 

2 (n,p) = (60,8) 

MM-Lasso 

1.09 

0 

0.53 

1.77 

0 

0.51 

1.21 

0 

0.46 

adaptive MM-Lasso 

1.07 

0 

0.21 

1.75 

0 

0.18 

1.18 

0 

0.16 

Sparse-LTS 

1.16 

0 

0.69 

1.76 

0 

0.67 

1.24 

0 

0.57 

Lasso 

1.07 

0 

0.48 

1.82 

0 

0.46 

5.38 

0.39 

0.29 

adaptive Lasso 

1.07 

0 

0.31 

1.73 

0 

0.31 

5.54 

0.47 

0.19 

Oracle 

1.04 

0 

0 

1.72 

0 

0 

1.10 

0 

0 

3 (n,p) = (30,100) 

MM-Lasso 

1.69 

0.13 

0.21 

1.75 

0.10 

0.27 

1.28 

0.15 

0.17 

adaptive MM-Lasso 

1.77 

0.26 

0.09 

1.80 

0.21 

0.09 

1.39 

0.29 

0.06 

Sparse-LTS 

2.25 

0 

1 

2.14 

0 

1 

1.78 

0.01 

0.97 

Lasso 

1.74 

0.11 

0.22 

1.90 

0.12 

0.27 

10.7 

0.55 

0.21 

adaptive Lasso 

1.74 

0.21 

0.13 

1.94 

0.22 

0.14 

10.7 

0.72 

0.09 

Oracle 

1.63 

0 

0 

1.73 

0 

0 

1.27 

0 

0 

4 (n,p) = (100,200) 

MM-Lasso 

1.90 

0.02 

0.12 

2.02 

0.02 

0.1 

1.90 

0.08 

0.09 

adaptive MM-Lasso 

1.78 

0.08 

0.02 

1.89 

0.07 

0.01 

1.71 

0.15 

0.03 

Sparse-LTS 

3.32 

0.14 

0.11 

3.17 

0.12 

0.01 

2.43 

0.13 

0.12 

Lasso 

1.96 

0.02 

0.23 

2.19 

0.02 

0.23 

7.79 

0.51 

0.1 

adaptive Lasso 

2.16 

0.04 

0.15 

2.41 

0.05 

0.16 

9.10 

0.56 

0.07 

Oracle 

1.64 

0 

0 

1.77 

0 

0 

1.28 

0 

0 

5 (n,p) = (100,200) 

MM-Lasso 

1.92 

0.16 

0.08 

1.89 

0.11 

0.06 

1.42 

0.16 

0.05 

adaptive MM-Lasso 

1.94 

0.29 

0.03 

1.89 

0.22 

0.02 

1.47 

0.31 

0.01 

Sparse-LTS 

1.89 

0.13 

0 

1.98 

0.11 

0 

1.47 

0.15 

0 

Lasso 

1.88 

0.11 

0.21 

2.12 

0.12 

0.22 

6.33 

0.57 

0.10 

adaptive Lasso 

2.06 

0.18 

0.13 

2.29 

0.19 

0.14 

6.75 

0.70 

0.10 

Oracle 

1.64 

0 

0 

1.77 

0 

0 

1.29 

0 

0 

6 (n,p) = (50,250) 

MM-Lasso 

4.05 

0.12 

0.07 

1.99 

0 

0.06 

2.05 

0.08 

0.05 

adaptive MM-Lasso 

3.99 

0.18 

0.03 

1.80 

0.01 

0.01 

1.79 

0.12 

0.02 

Sparse-LTS 

4.72 

0.26 

0.12 

2.60 

0.04 

0.09 

2.22 

0.07 

0.11 

Lasso 

3.67 

0.05 

0.07 

2.04 

0.01 

0.07 

30.5 

0.62 

0.03 

adaptive Lasso 

3.97 

0.06 

0.06 

2.26 

0.01 

0.06 

31.3 

0.64 

0.02 

Oracle 

3.13 

0 

0 

1.67 

0 

0 

1.12 

0 

0 


Table 1 

Results for all the simulation scenarios, with normal, t{3) and t(l) distributed errors. 
RMSE, MAD, FNR and FPR, averaged over 500 replications are reported for each 

estimator. 
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Finally, we calculated the computing times of the adaptive MM-Lasso, 
the MM-Lasso and the Sparse-LTS for several of the considered scenarios, 
for the case of normal errors and no contamination. Since the computing 
times for the adaptive MM-Lasso and the MM-Lasso were very similar, we 
only report the results for the adaptive MM-Lasso. Computing times were 
averaged over 5 replications and calculations were performed on R 3.0.2 on 
a 3.07x4 GHz Intel Core i7 PC. We see that in Scenarios 1, 3 and 5 the 
Sparse-LTS is considerably faster than the adaptive MM-Lasso. However, in 
Scenario 6, the adaptive MM-Lasso is 3 times faster than the Sparse-LTS. 
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Scenario 

Max. RMSE 

Max. FNR 

Max. FPR 

1 (n,p) = (40,8) 

MM-Lasso 

4.38 

0.11 

0.57 

adaptive MM-Lasso 

4.43 

0.25 

0.32 

Sparse-LTS 

4.92 

0.07 

0.95 

Lasso 

5.78 

0.27 

0.49 

adaptive Lasso 

6.14 

0.36 

0.33 

Oracle MM 

3.71 

0 

0 

2 (n,p) = (60,8) 

MM-Lasso 

1.39 

0 

0.59 

adaptive MM-Lasso 

1.38 

0.01 

0.36 

Sparse-LTS 

1.42 

0 

0.92 

Lasso 

4.89 

0.19 

0.56 

adaptive Lasso 

5.13 

0.25 

0.38 

Oracle MM 

1.21 

0 

0 

3 (n,p) = (100,30) 

MM-Lasso 

2.02 

0.20 

0.35 

adaptive MM-Lasso 

2.11 

0.36 

0.21 

Sparse-LTS 

3.18 

0 

1 

Lasso 

3.05 

0.25 

0.26 

adaptive Lasso 

3.24 

0.41 

0.15 

Oracle MM 

2.09 

0 

0 

4 (n,p) = (100,200) 

MM-Lasso 

4.14 

0.13 

0.24 

adaptive MM-Lasso 

4.02 

0.21 

0.12 

Sparse-LTS 

5.25 

0.28 

0.15 

Lasso 

6.74 

0.31 

0.21 

adaptive Lasso 

7.96 

0.40 

0.14 

Oracle MM 

2.09 

0 

0 

5 (n,p) = (100,200) 

MM-Lasso 

2.48 

0.21 

0.15 

adaptive MM-Lasso 

2.72 

0.37 

0.05 

Sparse-LTS 

2.14 

0.22 

0 

Lasso 

20.25 

0.64 

0.15 

adaptive Lasso 

13.03 

0.79 

0.06 

Oracle MM 

2.09 

0 

0 

6 (n,p) = (50,250) 

MM-Lasso 

4.97 

0.36 

0.08 

adaptive MM-Lasso 

5.08 

0.45 

0.04 

Sparse-LTS 

5.40 

0.47 

0.11 

Lasso 

6.04 

0.42 

0.07 

adaptive Lasso 

7.89 

0.45 

0.06 

Oracle MM 

3.68 

0 

0 


Table 2 

Results for all the scenarios with normal errors and 10% contaminated observations. 
Maximum RMSEs, FNRs and FPRs over all outlier sizes are averaged over 100 

replications. 
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Scenario 

adaptive MM-Lasso 

Sparse-LTS 

1 (n,p) = (40,8) 

3.33 

0.7 

3 (n,p) = (100,30) 

7.35 

1.71 

5 (n,p) = (100,200) 

41.75 

28.51 

6 (n,p) = (50,250) 

8.05 

25.89 


Table 3 

Computing times in seconds for the adaptive MM-Lasso and the Sparse-LTS, averaged 

over 5 replications. 


5. A real high-dimensional data set. In this section, we analyse a 
data set corresponding to electron-probe X-ray microanalysis of archaeologi¬ 
cal glass vessels, where each of n = 180 glass vessels is represented by a spec¬ 
trum on 1920 frequencies. For each vessel the contents of thirteen chemical 
compounds are registered. This data set appears in Janssens et al. (1998), 
and was previously analysed in Maronna (2011). We fit a linear model where 
the response variable is the content of the 13th chemical compound (PbO) 
and the carriers are the 1920 frequencies measures on each glass vessel. Since 
for frequencies below 15 and above 500 the values of Xij are almost null and 
show very little variability, we keep frequencies 15 to 500, so that we have 
p = 486. We apply the MM-Lasso, the adaptive MM-Lasso, the Sparse-LTS, 
the Lasso and the adaptive Lasso estimators to the data. 

The MM-Lasso selects seven variables: the 28th, 14:5th, 337th, 338th, 
372nd, 374th and 403rd frequencies. The adaptive MM-Lasso selects four 
variables: the 28th, 145th, 337th and 374th frequencies. Thus, the adap¬ 
tive MM-Lasso drops three of the variables selected by the MM-Lasso. The 
Sparse-LTS selects three variables: the 338th and 403rd and 466 frequencies. 
The Lasso selects 70 variables, the adaptive Lasso selects 49. Hence, all three 
robust estimators produce models that are sparser and easier to interpret. 

To asses the prediction accuracy of the estimators, we used 5-fold cross- 
validation. The criterion used was a r-scale of the residuals, calculated as in 
Maronna and Zamar (2002). The adaptive MM-Lasso and the Lasso show 
the best behaviour by far, followed by the Lasso, the adaptive Lasso and the 
Sparse-LTS, in that order. 
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r-scale 

MM-Lasso 

0.086 

adaptive MM-Lasso 

0.083 

Sparse-LTS 

0.329 

Lasso 

0.131 

adaptive Lasso 

0.138 


Table 4 

Cross-validated r-scale of the residuals of each of the estimators for the electron-probe 

X-ray microanalysis data. 


6. Conclusions. We have studied the robust and asymptotic properties 
of MM-Bridge and adaptive MM-Bridge regression estimators. We proved 
that, for the case of a fixed nnmber of covariates, MM-Bridge estimators can 
have the oracle property defined in Fan and Li (2001) whenever q < 1. We 
proved that adaptive MM-Bridge estimators can have the oracle property 
for all f < 1. We also derived the asymptotic distribution of the MM-Ridge 
estimator of Maronna (2011). 

We proposed an algorithm to calcnlate both the MM-Lasso and the adap¬ 
tive MM-Lasso. Onr simnlation study suggests that, at least for the scenarios 
considered, the proposed MM-Lasso and adaptive MM-Lasso estimators pro¬ 
vide the best balance between prediction accnracy and sparse modelling for 
uncontaminated samples, and stability in the presence of ontliers. The adap¬ 
tive MM-Lasso reduces the false positive ratio of the MM-Lasso, with the 
unpleasant, and foreseeable, side effect of an increase in the false negative 
ratio. We note that even though we derived our asymptotic results for the 
case of a fixed number of covariates, the MM-Lasso and adaptive MM-Lasso 
estimators can be calcnlated for p > n. The study of the asymptotic prop¬ 
erties of the these estimators for regression models with a diverging number 
of parameters is part of our future work. 

APPENDIX A: APPENDIX 

Proof of Theorem 1. Take C C {1,2, ...,n} such that #C = n-1 and 
a sequence {:K^^,yNi)Nm, such that {xJj^.,yN,i) = (xf,yi) for f 0 C and all 

N £ N. Let /3g and denote the estimators f3^ and /3^ compnted in 
(x^-, PArOA^eN- Since there are a finite number of sets included in {1, ...,n}, 

to prove the theorem it will be enough to show that {Pb)n is bounded. 
Snppose that this is not so, then eventually passing to a snbsequence we can 
assnme that \\(3b || — S' oo when N —oo. Since pi is bounded, for sufficiently 
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large N we have that 




\Sn(r(/3i ))/ 


+ ^nW^B\\q > ^ Pi 

i=l 


i^) 

\Sn(r(/3i ))/ 


+ ^nllollg 


which contradicts the definition of f3^. 


□ 


Proof of Theorem 2. Let m = nFBP { j 32 ). Take C C {1,2, ...,n} 
such that < m and a sequence (x^-,yArj)ArgH) such that = 

(xf, Ui) for i 0 C and all G N. Let (3 ^, (32 and f3i denote the estimators 

(3^, P 2 and (3^ computed in (x^^, y7Vi)AGN- Note that since ((C < m, (32 
is bounded. Since there are a finite number of sets included in {1, ...,n}, to 

prove the theorem it will be enough to show that {(3j^)n is bounded. Sup¬ 
pose that this is not so, then eventually passing to a subsequence we can 
assume that for some jo, °° when N -A 00 . Hence, there exists 

No, such that for N > No, (3^jo 0. It follows that l/3yijol*/l'^^jo I"' 

Since p\ is bounded, for sufiiciently large N we have that 


tr Vs„(r(/3i))/ 


-|- L, 


1 / 3 . 


1 / 3 : 


'A,j\ 


2,il 


> ^Pl 

2=1 


\Sn(r(/3i ))/ 


p 

+ tn ^ 
2=1 


|0|^ 


which contradicts the definition of (3^ . 


□ 


Define for /3 G s{(3) by 

and let 

"'"'-'■fw) 

It can be readily verified that s(/3) is continuous and positive. Lemma 4.2 
of Yohai and Zamar (1986) shows that s(/3) has a unique minimum at /3 = 
(3o, and hence proves the Fisher consistency of S-estimators of regression. 
Theorem 6 of Fasano et al. (2012), shows that g{(3) has a unique minimum 
at (3 = (3o, and hence proves the Fisher consistency of MM-estimators of 
regression. 

The following Lemma, which appears in Yohai and Zamar (1986) as Lemma 
4.5, is a key result. 
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Lemma 1. Let {xj^yi), i = 1, he i.i.d observations with distribution 
Hq, which satisfies (2). Assume [B1]-[B3] hold. Let K he a compact 

set. Then 

sup |sn(r(/3)) - s(/3)| 0 

I3&K 

To ease notation, we will henceforth note Sn = s„(r(/3^)), where (3i is as 
in (10). 


Proof of Theorem 3. We hrst prove (i). Let 

Zi{/3) = 4(r(/3)) + ^ll/3||J, 

n 

so that argmin^gKP Z^{f3) = (5pg. To prove (i), it suffices to show that 

(14) (3pg is bounded with probability 1 
and that given a compact set K, we have that 

(15) sup \Zl{(3) - s^(/3)| 0. 

I3GK 


Theorem 4.1 of Yohai and Zamar (1988) shows that f3g converges almost 
surely to /3 q and so (14) follows from (9). Note that the second term in Z^ 
converges uniformly to zero over compact sets, and hence Lemma 1 and the 
continuity of s(/3) show that (15) holds. Thus (3) is proved. 

Next, we prove (iii). The proof of (ii) is essentially the same, and is thus 
omitted. 

Note that by definition of /3^ 


(16) 



2 = 1 


f yi-^lPA \ 

\ Sn ) 



2=1 




The second term in (16) is 


s 



i=i 




since f32 is consistent by assumption. Since /3^ is consistent by assumption, 
by Lemma 1, Sn 'S(/3o)- Thus, the Law of large numbers and the Bounded 
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convergence theorem imply that the right hand side of (16) converges almost 
surely to 

u 


b* = Ef,Pi 


Hence, 


1 


lim sup — 


Pi 


s{(3o)J ■ 

yi-xJPA\ 


< b* a.s.. 


i=l 


One can easily show that the graphs of the family of functions 

' y — x^b\ 


n = {pi 


■) 


: b G s > 0 ^ , 


form a VC class of sets with a constant envelope. The proof of this is es¬ 
sentially the same as the one that appears on page 29 of Pollard (1984). It 
follows that "H is a Glivenko-Cantelli class of functions, i.e. 


(17) 


, 1 

sup I — 
beRP, s>0 ^ 


Ep. 




0 . 


Hence, it follows from (17) and Theorem 6 of Fasano et al. (2012) that for 
any e > 0 

li„, i„f 

£<II/3-/3o|| n ^ \ Sn J 

It must be that 

Pa “ 4 - Po- 


□ 


Proof of Theorem 4. We prove (ii), the proof of (i) is essentially the 
same, but replacing for An and taking = 0. 

Let 



so that argmin/ 3 gRp Z^{(3) = /3^. 
Note that 
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A second order Taylor expansion shows that 


1 




Pi 


riiPA) 


71 ^ 




i=l 


^ U 1/^2,il 


n 


E 


Pi 


i=l 


Ui \ {(3 a - (3oV 


1 1 


nSn 

n 


i=l 


Ui 


X,; 


+5^(/3.-/3„r -E^; 


, (ui- Cixf {(3 a - Po) 


xix^ {Pa - Po) 


i=l 


+—y 2 

n ^ 


\PA,j 

'j^l 


with 0 < Ci ^ 1- By Lemma 1, Sn ^(/Sq), and hence by Lemma 4.2 of 
Yohai(1985) 


1^^/ ^i- Ci^J{PA- Po) 


T a.s. _ ,, 

XiXi ^ EfoPi 


u 


s{Po) 


V. 


Then 




1 (f^A - Po) 


2 = 1 


^i^Ij (Pa- Po) 
— CnWPA ~ PoW ) 


where Cn cq > 0. 

We also have that by Lemma 5.1 of Yohai (1985) and the Central Limit 
Theorem 


Bn = 




Put 


Cn = ^E 

71 


|/3aj|* \Po,j\ 


^ 1/^2,il^ |/?2j|^ 


Then, since Pa is strongly consistent for /3 q and the first s coordinates 
of Pq are non zero, for large enough n the first s coordinates of Pa stay 
away from zero with arbitrarily high probability. Applying the Mean Value 
Theorem we get that 


C..P-E 

71 < ^ 


l/3ojr t, 


^ 1/^2,, 1^ \P2J 


I - (3o,j) 
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for some 9j such that \6j — /3o,j| < \0j — Since in = 0{y/ri) and and 
(32 are consistent, we have that for some M > 0, for large enough n, with 
arbitrarily high probability 

Cn > —^||/ 3 a ~ PqW- 

y n 

Then 

ZICPa) - Zli^o) = A - ^(3a - (^o)^Bn + Cn 

y/n 

> c„||(3a - /3o)f - ^II(3 a - Po)\\\\Bn\\ + Cn 

\/n 

= ^||(3a - /3o)||(c.V^||(3a - /3o)ll - ll^nll 

y/n 

+ V^/0A-MCn). 


Now, since - Z^^q) < 0, we have that. 


v^ii3a -3oii < 


\Br, 


Vn/0A- f3o\\Cn 


But 

V^/\0A-MCn>-M. 

Hence, v^||3a “ M = Op{l). 


□ 


Proof of Theorem 5. We prove (ii). The proof of (ii) is essentially the 
same, replacing in by and taking ? = 0. 

We follow Lemma 2 of Huang et al. (2008). Since by Theorem 4 (3a is 
v^-consistent, for a sufficiently large C > 0 and n, ||/3a “/^oll — Cjy/n with 
arbitrarily high probability. 

Let 


14(ui,U2) = 

i=l 


riiPoj + Ul/y/ra, Pqjj + U 2 /y/n) \ 

Sn J 


+ 



\Po,j +Ui,j/Vn\^ 
|32j|^ 


V 

+ E 

i=s+i 



Then for large enough n, with arbitrarily high probability, {PapPaii) 
is obtained by minimizing 14 ,(ui,U 2 ) over ||ui IP + ||u 2 |p < C^. We will 
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show that if ||ui|p + ||u 2 |p < and ||u 2 || > 0 then, for large enough 
n, V"n(ui,U 2 ) — V^(ui,0p_s) > 0 with arbitrarily high probability and the 
theorem will follow. 

It is easy to see that 


2 = 1 


ri{Po,i + ui/y/n, U2/y/n)\ 

- - Pi 

Sn / 


I4(ui,U 2) - ■I4(ui,0p_s) = 
ri{Po,i + Ui/v^n, Op.^) 


Applying the Mean Value Theorem we get 


+ 


P I \t 

^r). X ^ si 


r,t/2 


jpt/2 ISod"' 


(/) + (//). 


{!) = 


1 -1 


(0^,U2)'^^— '^i^l 
y/n Sn ^ 

2=1 


nK 


where 0* = {Poj + ui/y/n, (1 — a„)u 2 /-\/n) for some an G [0,1]. Applying 
the Mean Value Theorem once more we get 


1 -1 


(05,U2)^^- 

y/lT- Sn ^ 
2=1 



Xj = 


1 -1 
y/n Sn 


( Os , U 2)'^^'01 

2=1 



XjT 


1 1 
y/Esl 


(0s,U2)^ 

2=1 


r^ie: 


(ui/y/n, (1 - an)u 2 /y/n) = 


1 -1 




1 1 
nsl 


( Os , U 2)'^^'01 

2=1 



an)u2), 


where ||0** — /3o|| < ||0Ji — /3oll- By Lemma 5.1 of Yohai (1985) and Lemma 
1, the first term in the last equation is Op (||u 2 ||). The second term is also 
Op(||u 2 ||), by Lemma 1, the fact that by [Bl] ■0^ is bounded, [B4] and the 
Law of large numbers. 

On the other hand 


P I \t 

s| 




j = S+l 


l/52,il 


V 

E 

j=s+i 




\y/nl3‘ 


U2 


2 J I 
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since /32 is y^-consistent by assumption. Note also that ||u 2 ||* > ||u 2 ||^ 
Hence, for some Mi, M 2 > 0, for sufficiently large n that does not depend 
on U 2 , with arbitrarily high probability, we have that 

14(ui,U 2) - 14(ui,0p_5) > -Mi||u 2|| + M2inn^‘^"*)/^||U2||* 

(18) = ||u 2 ||*(-Mi||u 2 f-* + M2i„n(^-*)/2). 

Finally, since by assumption —>• 00 , we have that for any se¬ 

quence of non-zero U 2 ,n the right hand side of (18) is strictly positive for 
sufficiently large re. 

□ 


Proof of Theorem 6. We prove (ii). The proof of (i) is essentially the 
same, replacing by and taking ? = 0. 

For 0 G M® let p'(0) = i X) Note that by Theorem 

^ i=i 

3, f3j^ is strongly consistent for /3 q and hence with probability 1 all the 
coordinates of j3j^ j stay away from zero for a sufficiently large re. Also, by 
Theorem 5, (3^^ jj = Op_s with probability tending to one. Then for large 
enough re, with arbitrarily high probability the partial derivatives for the 
first s coordinates of at (3^ exist, and hence 


^ 1 -1 , 
Os = —^V’l 

\ln Sn ^ 

1 1 ” 

= ^-E* 

2 = 1 




Vi - 


^i,I + ^P'iPA,l) 

y/n 


Then the Mean Value Theorem gives 


^ 1 -1 ^ , 
o.s = —— 


y/n Sn 
^ 2=1 




1 


2 '^n\/n{(3j^ j — Pq j) + —=p'(/3^/), 
Sn y/n 


where 




n 


T 


2=1 


and ||0* — PqjW < \\^A,i ~ Po,i\\- 
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Then 


1 


Vn(0A,I - f^oj) = ^ 


'Vi-^llPo/ 


Xi,l 


- S. 


n 


i=l 

w-^p'(3aj) 


By Lemma 1, Sn s{Pq). By Lemma 5.1 of Yohai (1985) and the Central 
Limit Theorem 


1 




'yi-xfjfBoi'' 


x,,7 4iV,(0,a(V’i,To)Vx,) 


2 = 1 


By Lemma 4.2 of Yohai (1985) and Lemma 1 

“4-6(V^i,Fo)V,,. 

Since in!\pn —)• 0, the theorem follows from Sluztky’s Theorem. 


□ 


Proof of Theorem 7. We dehne for z e 



p 

+ -^n(^^ \Po,j + ~ 14,j 4) 

1=1 


so that argmin(i?„) = — /3q). We will show that for each compact 

set K C Ml’, Rn converges weakly to R in To do so, we will verify 

conditions (i) and (ii) of Theorem 2.3 of Kim and Pollard (1990). 

We first prove condition (i): finite-dimensional convergence of Rn to R. A 
second order Taylor expansion shows that 


(19) 


—z 


T 


1 1 

Sn Vn 


^4 

2=1 



Eo. 


n{(3o + z/y/n) 
Sn 


-pi 



1 1 ^1 


2 s± 


n 


Ev-'i 


2 = 1 



T 

Z, 


with 0 < ^2 ^ 1- 
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It can be easily verified that for q > 1 

p p 

(20) \Po,j + ZjlVn\'^ - l/^OjD ^ Aog^2;jS5n(/3oj)|/3ojl'^"^ 

j=l i=l 

uniformly over compact sets, whereas for q = 1 


( 21 ) 


p 


^n (y ^ |/5o,j “I" l/^0,j|) 

i=i 


p 

i=l 


+\zj\I{l3oj — 0 )), 


uniformly over compact sets. 

Then the finite-dimensional convergence follows from (19), (20), (21), 
Lemma 1, Lemmas 4.2 and 5.1 of Yohai (1985), Slutzky’s Theorem and 
the Cramer-Wold device. See the proof of Theorem 6 for more details. 

We now turn to proving condition (ii) of Theorem 2.3 of Kim and Pollard 
(1990), the stochastic equicontinuity of Rn- Fix e,r] and M > 0. Let ||z|| < 
M, llz'll < M. 

A second order Taylor expansion shows that 


^ ^ / r-AAo + z/yTl) \ 


1=1 


Pi 


ri{l3Q + z! / y/n) 




1 1 


2 s± n 


i=l 


— [Z — Z 

' y/n 
AT _ n _ 


Ui - xf z! jy/n) 


Xj-L 


, f Ui- Cixf z/y/n- {1- Ci)xf z'/^/n 


Xixf(z -z'). 


with 0 < Ci < 1. Applying the Mean Value Theorem to the first term in the 
Taylor expansion we get 


— z — z 


J\T 


1 1 


Sn \ n 




2=1 


T / 
Ui - X, Z 


VS)x, 


— z — z 


T-A E* (-)-.+ 

5n \/n ^ ^ f 


2=1 


si n ^ 


2=1 


Ui — Kixjz' u/n 


JTj 
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with 0 < Kj < 1. Then if ||z — z'|| < 6, by Lemma 1 and Lemmas 4.2 and 
5.1 of Yohai (1985) and the fact that by [Bl] is bounded, we have 
( 22 ) 




niPo + z/^/n) 


-pi 


r^(/3o + z7v/n) 


< 50p{l) + 6^0p{l). 


Let P* stand for outer probability. Then it follows from (22) that for suffi¬ 
ciently small 5 


lim sup P* sup I 


^^ / ri{pQ + z/^/n) \ 


2 = 1 


Pi 


riiPo + z'/ ^/n) 



< e, 


where the supremum runs over ||z|| < M, ||z7 < M, ||z — z'|| < <5. Recalling 
(20) and (21) we see that we have proven condition (ii). 

Since by Theorem 4, argmin(R„) = — /3q) = Op(l), the theorem 

follows from Theorem 2.7 of Kim and Pollard (1990). □ 


Proof of Theorem 8. After noting that 

p p 

^ Ao \Zj\n{Poj = 0) 

j=l i=l 

uniformly over compact sets, the proof follows along the same lines as the 
proof of Theorem 7. □ 
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